05 Machine Learning

Right click to download this notebook from GitHub.


With the data preparation complete, this step will demonstrate how you can configure a scikit-learn or dask_ml pipeline, but any library, algorithm, or simulator could be used at this stage if it can accept array data. In the next step of the tutorial, Data Visualization you will learn how to visualize the output of this pipeline and diagnose as well as ensure that the inputs to the pipeline have the expected structure.

Recap: Loading data

In [1]:
import intake

cat = intake.open_catalog('../catalog.yml')
l5_da = cat.l8.read_chunked()

Subsetting data

To speed things up for tutorial purposes, we'll use a subset of these data in the following examples. There are many ways to subset data in xarray . Here we select the central third of the data using index selection, which will be around 1500 x 1500 pixels.

In [2]:
nbands, ny, nx = l5_da.shape
bounds = int(2*ny/5), int(3*ny/5), int(2*nx/5), int(3*nx/5)
bounds
Out[2]:
(3176, 4764, 3128, 4692)
In [3]:
l5_da = l5_da[:, bounds[0]:bounds[1], bounds[2]:bounds[3]]
l5_da
Out[3]:
<xarray.DataArray (band: 7, y: 1588, x: 1564)>
dask.array<shape=(7, 1588, 1564), dtype=int16, chunksize=(1, 152, 200)>
Coordinates:
  * y        (y) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.283e+06 4.283e+06
  * x        (x) float64 3.371e+05 3.372e+05 3.372e+05 ... 3.84e+05 3.84e+05
  * band     (band) int64 1 2 3 4 5 6 7
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)
In [4]:
import hvplot.xarray

l5_da.hvplot(kind='image', x='x', y='y', groupby='band', datashade=True, width=450, height=400, cmap='greys')
Out[4]:

Reshaping Data

We'll need to reshape the image to be how dask-ml / scikit-learn expect it: (n_samples, n_features) where n_features is the number of bands and n_samples is the total number of pixels in each band. Essentially, we'll be creating a bag of pixels out of each image, where each pixel has multiple features (bands), but the ordering of the pixels is no longer relevant. In this case we start with an array that is n_bands by n_y by n_x (7, 2000, 2000) and we need to reshape to an array that is (n_samples, n_features) (4e6, 7). We'll first look at using NumPy, then Xarray.

Numpy

Data can be reshaped at the lowest level using NumPy, by getting the underlying values from the xarray.DataArray , and using flatten and transpose to get the right shape.

In [5]:
import numpy as np
In [6]:
arr = l5_da.values
arr.shape
Out[6]:
(7, 1588, 1564)

Since we want to flatten along the x and y but not along the band axis, we need to iterate over each band and flatten the data.

In [7]:
flattened = np.array([arr[i].flatten() for i in range(arr.shape[0])])
flattened
Out[7]:
array([[1094, 1083, 1095, ...,  988, 1015, 1063],
       [1304, 1308, 1330, ..., 1179, 1202, 1254],
       [1727, 1751, 1780, ..., 1625, 1638, 1689],
       ...,
       [2554, 2555, 2605, ..., 2548, 2573, 2654],
       [2923, 2933, 2977, ..., 3095, 3108, 3314],
       [2734, 2742, 2768, ..., 2190, 2282, 2574]], dtype=int16)
In [8]:
flattened.shape
Out[8]:
(7, 2483632)

We can reorder the dimensions using .transpose

In [9]:
sample_by_feature = flattened.transpose()
sample_by_feature
Out[9]:
array([[1094, 1304, 1727, ..., 2554, 2923, 2734],
       [1083, 1308, 1751, ..., 2555, 2933, 2742],
       [1095, 1330, 1780, ..., 2605, 2977, 2768],
       ...,
       [ 988, 1179, 1625, ..., 2548, 3095, 2190],
       [1015, 1202, 1638, ..., 2573, 3108, 2282],
       [1063, 1254, 1689, ..., 2654, 3314, 2574]], dtype=int16)
In [10]:
sample_by_feature.shape
Out[10]:
(2483632, 7)

Since numpy.array s are not labeled data, the semantics of the data are lost over the course of these operations, as the necessary metadata does not exist at the NumPy level.

Xarray

By using xarray methods to flatten the data, we can keep track of the coordinate labels ('x' and 'y') along the way. This means that we have the ability to reshape back to our original array at any time with no information loss.

In [11]:
flattened_by_band = l5_da.stack(z=('x','y'))
flattened_by_band
Out[11]:
<xarray.DataArray (band: 7, z: 2483632)>
dask.array<shape=(7, 2483632), dtype=int16, chunksize=(1, 44464)>
Coordinates:
  * band     (band) int64 1 2 3 4 5 6 7
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

We can reorder the dimensions using Dataset.transpose :

In [12]:
sample_by_feature = flattened_by_band.transpose('z', 'band')
sample_by_feature
Out[12]:
<xarray.DataArray (z: 2483632, band: 7)>
dask.array<shape=(2483632, 7), dtype=int16, chunksize=(44464, 1)>
Coordinates:
  * band     (band) int64 1 2 3 4 5 6 7
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

Now we have the data in the shape that we are looking for: a long array of pixels for each band. As a sanity check we can take a look at the plain np.array :

In [13]:
X = sample_by_feature.values
X
Out[13]:
array([[1094, 1304, 1727, ..., 2554, 2923, 2734],
       [1098, 1315, 1727, ..., 2523, 2885, 2690],
       [1084, 1304, 1710, ..., 2505, 2889, 2689],
       ...,
       [1069, 1267, 1718, ..., 2669, 3283, 2511],
       [1055, 1242, 1697, ..., 2668, 3284, 2542],
       [1063, 1254, 1689, ..., 2654, 3314, 2574]], dtype=int16)

Other preprocessing

Sometimes values are too big, need more axes, or need to have a affine transformation applied. Here we'll demonstrate doing this in numpy or xarray .

Add an axis:

In [14]:
np.expand_dims(X, 2).shape
Out[14]:
(2483632, 7, 1)
In [15]:
sample_by_feature.expand_dims(dim='e', axis=2)
Out[15]:
<xarray.DataArray (z: 2483632, band: 7, e: 1)>
dask.array<shape=(2483632, 7, 1), dtype=int16, chunksize=(44464, 1, 1)>
Coordinates:
  * band     (band) int64 1 2 3 4 5 6 7
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06
Dimensions without coordinates: e
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)
In [16]:
# Exercise: Try removing the extra axis using np.squeeze or .squeeze on the xarray object

Rescale:

Rescale (standardize) the data to input to the algorithm since the ML pipeline that we have selected expects input values to be small.

In [17]:
(X - X.mean()) / X.std()
Out[17]:
array([[-0.84910776, -0.60315855, -0.10774657, ...,  0.86082485,
         1.29299275,  1.07163846],
       [-0.84442301, -0.5902755 , -0.10774657, ...,  0.82451806,
         1.24848765,  1.02010624],
       [-0.86081963, -0.60315855, -0.12765674, ...,  0.8034367 ,
         1.2531724 ,  1.01893506],
       ...,
       [-0.87838743, -0.64649246, -0.11828725, ...,  0.99551132,
         1.71461997,  0.81046382],
       [-0.89478404, -0.67577213, -0.14288217, ...,  0.99434013,
         1.71579115,  0.84677061],
       [-0.88541455, -0.66171788, -0.15225166, ...,  0.97794352,
         1.75092675,  0.88424858]])
In [18]:
rescaled = (sample_by_feature - sample_by_feature.mean()) / sample_by_feature.std()
rescaled.compute()
Out[18]:
<xarray.DataArray (z: 2483632, band: 7)>
array([[-0.849108, -0.603159, -0.107747, ...,  0.860825,  1.292993,  1.071638],
       [-0.844423, -0.590275, -0.107747, ...,  0.824518,  1.248488,  1.020106],
       [-0.86082 , -0.603159, -0.127657, ...,  0.803437,  1.253172,  1.018935],
       ...,
       [-0.878387, -0.646492, -0.118287, ...,  0.995511,  1.71462 ,  0.810464],
       [-0.894784, -0.675772, -0.142882, ...,  0.99434 ,  1.715791,  0.846771],
       [-0.885415, -0.661718, -0.152252, ...,  0.977944,  1.750927,  0.884249]])
Coordinates:
  * band     (band) int64 1 2 3 4 5 6 7
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06

NOTE: Since the the xarray object is in dask, the actual computation isn't performed until .compute() is called.

In [19]:
# Exercise: Inspect the numpy array at rescaled.values to check that it matches the numpy array above.

ML pipeline

The Machine Learning pipeline shown below is just for the purpose of understanding the shaping/reshaping of the data. In practice you will likely be using a more sophisticated pipeline. Here we will use a version of SpectralClustering from dask_ml that is a scalable equivalent to operations from Scikit-learn that cluster pixels based on similarity (across all bands, which makes it spectral clustering by spectra!).

In [20]:
from dask_ml.cluster import SpectralClustering
from dask.distributed import Client
In [21]:
client = Client(processes=False)
client
Out[21]:

Client

Cluster

  • Workers: 1
  • Cores: 8
  • Memory: 17.18 GB

Now we will compute and persist the rescaled data to feed into the ML pipeline. Notice that X has the shape: n_samples, n_features as discussed above.

In [22]:
X = client.persist(rescaled)
X.shape
Out[22]:
(2483632, 7)
In [23]:
clf = SpectralClustering(n_clusters=4, random_state=0, gamma=None,
                         kmeans_params={'init_max_iter': 5},
                         persist_embedding=True)
In [24]:
%time clf.fit(X)
INFO:dask_ml.cluster.spectral:Starting check array
INFO:dask_ml.cluster.spectral:Finished check array
INFO:dask_ml.cluster.spectral:A: 80.00 kB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:B: 1.99 GB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:A2: 80.00 kB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:B2: 1.99 GB, (1, 1) blocks
INFO:dask_ml.cluster.spectral:V2.1: 79.48 MB, (2, 1) blocks
INFO:dask_ml.cluster.spectral:V2.2: 79.48 MB, (2, 1) blocks
INFO:dask_ml.cluster.spectral:U2.2: 79.48 MB, (2, 1) blocks
INFO:dask_ml.cluster.spectral:U2.3: 79.48 MB, (202, 1) blocks
INFO:dask_ml.cluster.spectral:Persisting array for k-means
INFO:dask_ml.cluster.spectral:k-means for assign_labels[starting]
INFO:root:Starting _check_array
INFO:root:Finished _check_array in 0:00:35.334534
INFO:root:Starting init_scalable
INFO:dask_ml.cluster.k_means:Initializing with k-means||
INFO:dask_ml.cluster.k_means:Starting init iteration  1/ 5 ,  1 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  1/ 5 ,  1 centers in 0:00:01.448212
INFO:dask_ml.cluster.k_means:Starting init iteration  2/ 5 ,  4 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  2/ 5 ,  4 centers in 0:00:01.219986
INFO:dask_ml.cluster.k_means:Starting init iteration  3/ 5 ,  5 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  3/ 5 ,  5 centers in 0:00:01.371860
INFO:dask_ml.cluster.k_means:Starting init iteration  4/ 5 ,  5 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  4/ 5 ,  5 centers in 0:00:01.283345
INFO:dask_ml.cluster.k_means:Starting init iteration  5/ 5 ,  8 centers
INFO:dask_ml.cluster.k_means:Finished init iteration  5/ 5 ,  8 centers in 0:00:01.483907
INFO:root:Finished init_scalable in 0:00:07.637946
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  0.
INFO:dask_ml.cluster.k_means:Shift: 0.1206
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  0. in 0:00:02.512338
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  1.
INFO:dask_ml.cluster.k_means:Shift: 0.0127
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  1. in 0:00:02.127863
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  2.
INFO:dask_ml.cluster.k_means:Shift: 0.0038
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  2. in 0:00:02.276411
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  3.
INFO:dask_ml.cluster.k_means:Shift: 0.0022
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  3. in 0:00:02.140672
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  4.
INFO:dask_ml.cluster.k_means:Shift: 0.0017
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  4. in 0:00:02.367923
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  5.
INFO:dask_ml.cluster.k_means:Shift: 0.0016
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  5. in 0:00:02.402487
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  6.
INFO:dask_ml.cluster.k_means:Shift: 0.0017
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  6. in 0:00:02.283466
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  7.
INFO:dask_ml.cluster.k_means:Shift: 0.0016
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  7. in 0:00:02.397148
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  8.
INFO:dask_ml.cluster.k_means:Shift: 0.0017
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  8. in 0:00:02.214448
INFO:dask_ml.cluster.k_means:Starting Lloyd loop  9.
INFO:dask_ml.cluster.k_means:Shift: 0.0017
INFO:dask_ml.cluster.k_means:Finished Lloyd loop  9. in 0:00:02.330260
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 10.
INFO:dask_ml.cluster.k_means:Shift: 0.0017
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 10. in 0:00:02.208320
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 11.
INFO:dask_ml.cluster.k_means:Shift: 0.0016
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 11. in 0:00:02.231624
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 12.
INFO:dask_ml.cluster.k_means:Shift: 0.0014
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 12. in 0:00:02.561655
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 13.
INFO:dask_ml.cluster.k_means:Shift: 0.0013
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 13. in 0:00:02.332339
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 14.
INFO:dask_ml.cluster.k_means:Shift: 0.0011
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 14. in 0:00:02.484939
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 15.
INFO:dask_ml.cluster.k_means:Shift: 0.0009
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 15. in 0:00:02.192804
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 16.
INFO:dask_ml.cluster.k_means:Shift: 0.0007
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 16. in 0:00:02.356344
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 17.
INFO:dask_ml.cluster.k_means:Shift: 0.0005
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 17. in 0:00:02.209073
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 18.
INFO:dask_ml.cluster.k_means:Shift: 0.0004
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 18. in 0:00:02.463674
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 19.
INFO:dask_ml.cluster.k_means:Shift: 0.0003
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 19. in 0:00:02.155638
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 20.
INFO:dask_ml.cluster.k_means:Shift: 0.0003
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 20. in 0:00:02.332153
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 21.
INFO:dask_ml.cluster.k_means:Shift: 0.0002
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 21. in 0:00:02.280695
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 22.
INFO:dask_ml.cluster.k_means:Shift: 0.0002
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 22. in 0:00:02.578644
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 23.
INFO:dask_ml.cluster.k_means:Shift: 0.0001
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 23. in 0:00:02.200003
INFO:dask_ml.cluster.k_means:Starting Lloyd loop 24.
INFO:dask_ml.cluster.k_means:Shift: 0.0001
INFO:dask_ml.cluster.k_means:Finished Lloyd loop 24. in 0:00:02.480572
INFO:dask_ml.cluster.spectral:k-means for assign_labels[finished]
CPU times: user 2min 12s, sys: 43.5 s, total: 2min 55s
Wall time: 1min 45s
Out[24]:
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=None, kernel_params=None,
          kmeans_params={'init_max_iter': 5}, n_clusters=4,
          n_components=100, n_init=10, n_jobs=1, n_neighbors=10,
          persist_embedding=True, random_state=0)
In [25]:
# Exercise: Open the dask status dashboard and watch the workers in progress.
In [26]:
labels = clf.assign_labels_.labels_.compute()
labels.shape
Out[26]:
(2483632,)

Un-flattening

Once the computation is done, the output can be used to create a new array with the same structure as the input array. This new output array will have the coordinates needed to be unstacked similarly to how they were stacked. One of the main benefits of using xarray for this stacking and unstacking is that allows xarray to keep track of the coordinate information for us.

In [27]:
template = sample_by_feature[:, 0]
template
Out[27]:
<xarray.DataArray (z: 2483632)>
dask.array<shape=(2483632,), dtype=int16, chunksize=(44464,)>
Coordinates:
    band     int64 1
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

NOTE: Since the original array is n_samples by n_features (4_000_000, 6) and the output only contains one feature (4_000_000,), the template structure for this data needs to have the shape (n_samples). We achieve this by just taking one of the bands.

In [28]:
output_array = template.copy(data=labels)
output_array
Out[28]:
<xarray.DataArray (z: 2483632)>
array([3, 3, 3, ..., 1, 1, 1], dtype=int32)
Coordinates:
    band     int64 1
  * z        (z) MultiIndex
  - x        (z) float64 3.371e+05 3.371e+05 3.371e+05 ... 3.371e+05 3.371e+05
  - y        (z) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.33e+06 4.33e+06
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)

With this new output array in hand, we can unstack back to the original dimensions:

In [29]:
unstacked = output_array.unstack()
unstacked
Out[29]:
<xarray.DataArray (x: 1564, y: 1588)>
array([[3, 3, 3, ..., 3, 0, 2],
       [3, 3, 3, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [3, 0, 0, ..., 3, 3, 3],
       [0, 2, 0, ..., 3, 3, 3],
       [0, 2, 0, ..., 1, 1, 1]], dtype=int32)
Coordinates:
    band     int64 1
  * x        (x) float64 3.371e+05 3.372e+05 3.372e+05 ... 3.84e+05 3.84e+05
  * y        (y) float64 4.331e+06 4.331e+06 4.331e+06 ... 4.283e+06 4.283e+06
Attributes:
    transform:   (30.0, 0.0, 243285.0, 0.0, -30.0, 4425915.0)
    crs:         +init=epsg:32611
    res:         (30.0, 30.0)
    is_tiled:    0
    nodatavals:  (-9999.0,)
In [30]:
l5_da.sel(band=4).hvplot(x='x', y='y', datashade=True, cmap='greys', width=450, height=400).relabel('Image') + \
unstacked.hvplot(x='x', y='y', datashade=True, cmap='Category10', width=450, height=400).relabel('Clustered')
Out[30]:

Geographic plot

The plot above is useful and quick to generate, but it isn't referenced against the underlying geographic coordinates, which is crucial if we want to overlay the data on any other geographic data sources. Adding the coordinate reference system in the hvplot method, ensures that the data is properly positioned in space. This geo-referencing is made very straightforward because of the way xarray persists metadata. We can even add tiles underneath.

In [31]:
import geoviews.tile_sources as gts
In [32]:
gts.ESRI * unstacked.hvplot(x='x', y='y', datashade=True, geo=True, height=500, cmap='Category10')
Out[32]:
In [33]:
# Exercise: Try adding a different set of map tiles. Use tab completion to find others.

Next:

Now that your analysis is complete, you are ready for some more information about Data Visualization you will learn how to visualize the output of this pipeline and diagnose as well as ensure that the inputs to the pipeline have the expected structure.